imagem

Summary

The presented project aims to develop a model to evaluate if a costumer has high probability of living the company. This is of extreme importance because it will allow to identify and, for example, offer a better plan that better satisfies the costumer in order to keep it as a client.

15 % of the costumers presented in the data set are churn.

At the end with a classification model (XGBoost) we were able to estimate with 98% is satisfied or not with the current tele company.

Customer Churn Prediction 2020¶

This competition is about predicting whether a customer will change telecommunications provider, something known as churning.

The training dataset contains 4250 samples. Each sample contains 19 features and 1 boolean variable churn which indicates the class of the sample. The 19 input features and 1 target variable are:

  • state, string. 2-letter code of the US state of customer residence
  • account_length, numerical. Number of months the customer has been with the current telco provider
  • area_code , string=area_code_AAA where AAA = 3 digit area code.
  • international_plan, (yes/no). The customer has international plan.
  • voice_mail_plan, (yes/no). The customer has voice mail plan.
  • number_vmail_messages, numerical. Number of voice-mail messages.
  • total_day_minutes, numerical. Total minutes of day calls.
  • total_day_calls, numerical. Total minutes of day calls.
  • total_day_charge, numerical. Total charge of day calls.
  • total_eve_minutes, numerical. Total minutes of evening calls.
  • total_eve_calls, numerical. Total number of evening calls.
  • total_eve_charge, numerical. Total charge of evening calls.
  • total_night_minutes, numerical. Total minutes of night calls.
  • total_night_calls, numerical. Total number of night calls.
  • total_night_charge, numerical. Total charge of night calls.
  • total_intl_minutes, numerical. Total minutes of international calls.
  • total_intl_calls, numerical. Total number of international calls.
  • total_intl_charge, numerical. Total charge of international calls
  • number_customer_service_calls, numerical. Number of calls to customer service
  • churn, (yes/no). Customer churn - target variable.
In [1]:
import warnings
warnings.simplefilter(action = 'ignore', category = FutureWarning)

import pandas as pd
import numpy as np

# Plotting
import matplotlib
from matplotlib import transforms, pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff
from scipy.stats import shapiro

# Train model
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.compose import make_column_transformer
from imblearn.over_sampling import SMOTE, ADASYN, SMOTENC
In [2]:
# define colors
GRAY1, GRAY2, GRAY3 = '#231F20', '#414040', '#555655'
GRAY4, GRAY5, GRAY6 = '#646369', '#76787B', '#828282'
GRAY7, GRAY8, GRAY9 = '#929497', '#A6A6A5', '#BFBEBE'
BLUE1, BLUE2, BLUE3, BLUE4 = '#174A7E', '#4A81BF', '#94B2D7', '#94AFC5'
RED1, RED2 = '#C3514E', '#E6BAB7'
GREEN1, GREEN2 = '#0C8040', '#9ABB59'
ORANGE1 = '#F79747'

Import data¶

In [3]:
data = pd.read_csv('data/train.csv', delimiter=',')
data.head()
Out[3]:
state account_length area_code international_plan voice_mail_plan number_vmail_messages total_day_minutes total_day_calls total_day_charge total_eve_minutes total_eve_calls total_eve_charge total_night_minutes total_night_calls total_night_charge total_intl_minutes total_intl_calls total_intl_charge number_customer_service_calls churn
0 OH 107 area_code_415 no yes 26 161.6 123 27.47 195.5 103 16.62 254.4 103 11.45 13.7 3 3.70 1 no
1 NJ 137 area_code_415 no no 0 243.4 114 41.38 121.2 110 10.30 162.6 104 7.32 12.2 5 3.29 0 no
2 OH 84 area_code_408 yes no 0 299.4 71 50.90 61.9 88 5.26 196.9 89 8.86 6.6 7 1.78 2 no
3 OK 75 area_code_415 yes no 0 166.7 113 28.34 148.3 122 12.61 186.9 121 8.41 10.1 3 2.73 3 no
4 MA 121 area_code_510 no yes 24 218.2 88 37.09 348.5 108 29.62 212.6 118 9.57 7.5 7 2.03 3 no

The train data has 4250 rows and 20 variables, being churn a binary type which we want to predict on test data.

Data QC (NA and null values)¶

NA values in data?

In [4]:
data.isna().sum()
Out[4]:
state                            0
account_length                   0
area_code                        0
international_plan               0
voice_mail_plan                  0
number_vmail_messages            0
total_day_minutes                0
total_day_calls                  0
total_day_charge                 0
total_eve_minutes                0
total_eve_calls                  0
total_eve_charge                 0
total_night_minutes              0
total_night_calls                0
total_night_charge               0
total_intl_minutes               0
total_intl_calls                 0
total_intl_charge                0
number_customer_service_calls    0
churn                            0
dtype: int64

Null values in data?

In [5]:
data.isnull().sum()
Out[5]:
state                            0
account_length                   0
area_code                        0
international_plan               0
voice_mail_plan                  0
number_vmail_messages            0
total_day_minutes                0
total_day_calls                  0
total_day_charge                 0
total_eve_minutes                0
total_eve_calls                  0
total_eve_charge                 0
total_night_minutes              0
total_night_calls                0
total_night_charge               0
total_intl_minutes               0
total_intl_calls                 0
total_intl_charge                0
number_customer_service_calls    0
churn                            0
dtype: int64

Duplicates samples?

In [6]:
data.duplicated().sum()
Out[6]:
0

There are no NA, null value or duplicated entries.

Split into train, validation and test¶

Let split the data into train (80%), validation (10%) and test (10%) in order to avoid leaked information into the model. Because we don't have a high quantity of data we will keep 80% for the training.

In [7]:
X = data.drop('churn', axis = 1)
y = data.churn
In [8]:
X_train, X_testval, y_train, y_testval = train_test_split(X, y, 
                                                          test_size = 0.2,
                                                         shuffle= True)

X_test, X_val, y_test, y_val = train_test_split(X_testval, y_testval, 
                                                test_size= 0.5,
                                               shuffle= True)

print(f'Train shape: {X_train.shape}')
print(f'Validation shape: {X_val.shape}')
print(f'Test shape: {X_test.shape}')

print(f'Sum of rows is ok: {X_train.shape[0] + X_val.shape[0] + X_test.shape[0] == data.shape[0] }')
Train shape: (3400, 19)
Validation shape: (425, 19)
Test shape: (425, 19)
Sum of rows is ok: True

We will have 3400 entries to train the model, 425 to optimize the model by tuning hyper parameters and 425 to make the final evaluation of model performance.

Data munging¶

In [9]:
X_train.head()
Out[9]:
state account_length area_code international_plan voice_mail_plan number_vmail_messages total_day_minutes total_day_calls total_day_charge total_eve_minutes total_eve_calls total_eve_charge total_night_minutes total_night_calls total_night_charge total_intl_minutes total_intl_calls total_intl_charge number_customer_service_calls
4056 MI 62 area_code_408 no no 0 119.3 94 20.28 224.0 81 19.04 156.7 78 7.05 12.1 6 3.27 5
4017 CO 97 area_code_415 no yes 27 125.9 90 21.40 123.4 91 10.49 230.0 71 10.35 7.3 7 1.97 0
2819 VT 60 area_code_415 no no 0 193.9 118 32.96 85.0 110 7.23 210.1 134 9.45 13.2 8 3.56 3
1069 MT 74 area_code_415 no no 0 162.7 102 27.66 292.0 105 24.82 183.3 80 8.25 8.7 6 2.35 0
344 AZ 117 area_code_408 no no 0 239.9 84 40.78 174.8 106 14.86 209.5 93 9.43 9.8 2 2.65 0

First we will simplify the area_code variable to leave only the 3 digits.

In [10]:
X_train2 = X_train.copy()
X_train2['area_code'] =X_train['area_code'].apply(lambda x: x.split('_')[-1])
In [11]:
X_train2.head()
Out[11]:
state account_length area_code international_plan voice_mail_plan number_vmail_messages total_day_minutes total_day_calls total_day_charge total_eve_minutes total_eve_calls total_eve_charge total_night_minutes total_night_calls total_night_charge total_intl_minutes total_intl_calls total_intl_charge number_customer_service_calls
4056 MI 62 408 no no 0 119.3 94 20.28 224.0 81 19.04 156.7 78 7.05 12.1 6 3.27 5
4017 CO 97 415 no yes 27 125.9 90 21.40 123.4 91 10.49 230.0 71 10.35 7.3 7 1.97 0
2819 VT 60 415 no no 0 193.9 118 32.96 85.0 110 7.23 210.1 134 9.45 13.2 8 3.56 3
1069 MT 74 415 no no 0 162.7 102 27.66 292.0 105 24.82 183.3 80 8.25 8.7 6 2.35 0
344 AZ 117 408 no no 0 239.9 84 40.78 174.8 106 14.86 209.5 93 9.43 9.8 2 2.65 0

Replace yes and no by 0 and 1, respectively.

In [12]:
X_train3 = X_train2.copy()

X_train3["international_plan"] = X_train2["international_plan"].map(dict(yes=1, no=0))
X_train3["voice_mail_plan"] = X_train2["voice_mail_plan"].map(dict(yes=1, no=0))

X_train3.head()
Out[12]:
state account_length area_code international_plan voice_mail_plan number_vmail_messages total_day_minutes total_day_calls total_day_charge total_eve_minutes total_eve_calls total_eve_charge total_night_minutes total_night_calls total_night_charge total_intl_minutes total_intl_calls total_intl_charge number_customer_service_calls
4056 MI 62 408 0 0 0 119.3 94 20.28 224.0 81 19.04 156.7 78 7.05 12.1 6 3.27 5
4017 CO 97 415 0 1 27 125.9 90 21.40 123.4 91 10.49 230.0 71 10.35 7.3 7 1.97 0
2819 VT 60 415 0 0 0 193.9 118 32.96 85.0 110 7.23 210.1 134 9.45 13.2 8 3.56 3
1069 MT 74 415 0 0 0 162.7 102 27.66 292.0 105 24.82 183.3 80 8.25 8.7 6 2.35 0
344 AZ 117 408 0 0 0 239.9 84 40.78 174.8 106 14.86 209.5 93 9.43 9.8 2 2.65 0

Convert objects to category on both samples and label datafrmes.

In [13]:
for col in ['state', 'area_code', 'international_plan', 'voice_mail_plan']:
    X_train3[col] = X_train3[col].astype('category')
In [14]:
y_train = y_train.astype('category')

y_train3 = y_train.copy()

y_train3 = y_train3.map(dict(yes=1, no=0))

Data Visualization¶

Tableau dashboard integration¶

An initial visualization was developed considering the churn percentage at each state, the account length and total charge, The dashboard built on Tableau is interactive and by selecting the state in the tree map the histogram, map and box plot will change. It can be seen that 'no churn' clients normally charged less than the others. This can be an important feature to evaluate if a client is satisfied or not.

In [90]:
%%html
<div class='tableauPlaceholder' id='viz1661427519889' style='position: relative'><noscript><a href='#'><img alt='Dashboard 2 ' src='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Pr&#47;Project4_16562381181040&#47;Dashboard2&#47;1_rss.png' style='border: none' /></a></noscript><object class='tableauViz'  style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='' /><param name='name' value='Project4_16562381181040&#47;Dashboard2' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='static_image' value='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Pr&#47;Project4_16562381181040&#47;Dashboard2&#47;1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /><param name='language' value='pt-BR' /><param name='filter' value='publish=yes' /></object></div>                <script type='text/javascript'>                    var divElement = document.getElementById('viz1661427519889');                    var vizElement = divElement.getElementsByTagName('object')[0];                    if ( divElement.offsetWidth > 800 ) { vizElement.style.width='1366px';vizElement.style.height='795px';} else if ( divElement.offsetWidth > 500 ) { vizElement.style.width='1366px';vizElement.style.height='795px';} else { vizElement.style.width='100%';vizElement.style.height='1277px';}                     var scriptElement = document.createElement('script');                    scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js';                    vizElement.parentNode.insertBefore(scriptElement, vizElement);                </script>

Firstly will we join together samples and targets on the same dataframe in order to compare statistics between the clients that are identified as churn and the ones that are not. Secondly we will build an interactive dashboard on Tableau and bring the main insights to the notebook.

Data visualization using Python¶

In [16]:
df = pd.concat([X_train3, y_train3], axis = 1)
In [17]:
df.head()
Out[17]:
state account_length area_code international_plan voice_mail_plan number_vmail_messages total_day_minutes total_day_calls total_day_charge total_eve_minutes total_eve_calls total_eve_charge total_night_minutes total_night_calls total_night_charge total_intl_minutes total_intl_calls total_intl_charge number_customer_service_calls churn
4056 MI 62 408 0 0 0 119.3 94 20.28 224.0 81 19.04 156.7 78 7.05 12.1 6 3.27 5 1
4017 CO 97 415 0 1 27 125.9 90 21.40 123.4 91 10.49 230.0 71 10.35 7.3 7 1.97 0 0
2819 VT 60 415 0 0 0 193.9 118 32.96 85.0 110 7.23 210.1 134 9.45 13.2 8 3.56 3 0
1069 MT 74 415 0 0 0 162.7 102 27.66 292.0 105 24.82 183.3 80 8.25 8.7 6 2.35 0 0
344 AZ 117 408 0 0 0 239.9 84 40.78 174.8 106 14.86 209.5 93 9.43 9.8 2 2.65 0 0

State¶

In [18]:
from plotly.subplots import make_subplots
import plotly.graph_objects as go

fig = make_subplots(rows=2, cols=1)

fig.add_trace(
    go.Bar(x = df.state.value_counts().index,
           y = df.state.value_counts().values,
           marker_color = BLUE4,
           opacity = 0.6,
    name = ''),
    row=1, col=1)    
    
fig.add_trace(
    go.Histogram(x=df.state.value_counts(),
                marker_color = BLUE1,
                opacity = 0.6,
    name = ''),
    row=2, col=1
)

# edit axis labels
fig['layout']['xaxis2']['title']='# Samples per state'
fig['layout']['yaxis']['title']='# Samples per state'
fig['layout']['yaxis2']['title']='State count'

# Edit the layout
fig.update_layout(template = 'simple_white',
                  font_family="Arial",
                  font_color= GRAY6, 
                  title_font_family="Arial Bold", 
                  title_font_size = 25, 
                  title_font_color= GRAY3, 
                  legend_title_font_color=GRAY6,
                 showlegend=False)

fig.update_layout(height=800, width=950, title_text="State Variable")

fig.show()

The overall number of samples for each state is 65, with a range from 50 and 90. The CA state is the state with the fewest number of samples (31), while WV is the one with the largest number of samples on train data (115). Lets see if the percentage of churn clients changes between states.

In [19]:
df_vis = df.copy()
df_vis['churn'] = df.churn.astype('int')

churn_perc = round(df_vis.groupby('state').sum()['churn'] *100 / df_vis.groupby('state').count()['churn'],
                   2 ).sort_values(ascending = False)

pd.DataFrame(churn_perc.describe())
Out[19]:
churn
count 51.000000
mean 14.082157
std 5.194213
min 3.700000
25% 10.440000
50% 13.040000
75% 17.435000
max 25.370000
In [20]:
stat, p = shapiro(churn_perc)
print('Statistics=%.3f, p=%.3f' % (stat, p))
# interpret
alpha = 0.05
if p > alpha:
    print('Sample looks Gaussian (fail to reject H0)')
else:
    print('Sample does not look Gaussian (reject H0)')
Statistics=0.973, p=0.291
Sample looks Gaussian (fail to reject H0)
In [21]:
fig = make_subplots(rows=2, cols=1)

fig.add_trace(
    go.Bar(x = churn_perc.index,
           y = churn_perc.values,
           marker_color = BLUE4,
           opacity = 0.6,
    name = ''),
    row=1, col=1)    
    
fig.add_trace(
    go.Histogram(x=churn_perc,
                marker_color = BLUE1,
                 nbinsx = 15,
                opacity = 0.6,
    name = ''),
    row=2, col=1
)

# edit axis labels
fig['layout']['xaxis2']['title']='percentage (%)'
fig['layout']['yaxis']['title']='percentage (%)'
fig['layout']['yaxis2']['title']='state count'

# Edit the layout
fig.update_layout(template = 'simple_white',
                  font_family="Arial",
                  font_color= GRAY6, 
                  title_font_family="Arial Bold", 
                  title_font_size = 25, 
                  title_font_color= GRAY3, 
                  legend_title_font_color=GRAY6,
                 showlegend=False)

fig.update_layout(height=800, width=950, title_text="% of churn per state")

fig.show()
  • The churn percentage is not similar within each state, which can tell us that the state can be an important variable to consider in the predictive modeling.
  • The overall mean churn is 14% while the range goes from 4% to almost 30%, almost x8.
  • The best state is HI (Hawaii), while the worst one is NJ (New Jersey).
  • We can also see that the churn percentage follows a normal distribution (based on histogram and shapiro test).

Account Length¶

In [22]:
print(f'The minimum time that a client was in the company was {df.account_length.min()} month.')
print(f'The average time is {round(df.account_length.median()/12, 0)} years / {round(df.account_length.median(), 0)} months')
print(f'The maximum time a client was with the company is {round(df.account_length.max()/12, 0)} years')
The minimum time that a client was in the company was 1 month.
The average time is 8.0 years / 99.0 months
The maximum time a client was with the company is 20.0 years
In [23]:
fig = go.Figure()
# Use x instead of y argument for horizontal plot
fig.add_trace(go.Box(x=df.account_length, name =''))

fig.update_layout(title = 'Account length (months)',
                  height=400, width=950,
                  template = 'simple_white',
                  font_family="Arial",
                  font_size = 18,
                  font_color= GRAY6, 
                  title_font_family="Arial Bold", 
                  title_font_size = 25, 
                  title_font_color= GRAY3, 
                  legend_title_font_color=GRAY6,
                 showlegend=False)

fig.show()

Lets compare the account length distribution between churn and no churn clients.

In [24]:
group_labels = ['Yes', 'No']

colors = [GRAY9, BLUE2]

# Create distplot with curve_type set to 'normal'
fig = ff.create_distplot([df[df.churn == 1].account_length, df[df.churn == 0].account_length], group_labels,
                         bin_size= 25, curve_type='normal', # override default 'kde'
                         colors=colors, show_rug=False)

fig['layout']['xaxis']['title']='Account length (months)'

# Add title
fig.update_layout(height=400, width=950,
                  template = 'simple_white',
                  font_family="Arial",
                  font_size = 18,
                  font_color= GRAY6, 
                  title_font_family="Arial Bold", 
                  title_font_size = 25, 
                  title_font_color= GRAY3, 
                  legend_title_font_color=GRAY6,
                  legend_title_text='Churn',
                 showlegend=True)
fig.show()

It can be seen that the account length distribution has no difference between clients with churn and clients without churn. We can see that although a person might be working with the telecom company for a long time, it might not be a reason to avoid the churn.

International and voice mail plan¶

Lets compare the number of clients with international and voice mail plans compare with churn and not churn.

In [25]:
# International Plan

new_index = ['No', 'Yes']

df_1_int = round(df[df.churn == 1]['international_plan'].value_counts() * 100 / df[df.churn == 1]['international_plan'].value_counts().sum() ,0)
df_1_int.index = new_index

df_0_int = round(df[df.churn == 0]['international_plan'].value_counts() * 100 / df[df.churn == 0]['international_plan'].value_counts().sum() ,0)
df_0_int.index = new_index

# Voice mail plan
df_1_voi = round(df[df.churn == 1]['voice_mail_plan'].value_counts() * 100 / df[df.churn == 1]['voice_mail_plan'].value_counts().sum() ,0)
df_1_voi.index = new_index

df_0_voi = round(df[df.churn == 0]['voice_mail_plan'].value_counts() * 100 / df[df.churn == 0]['voice_mail_plan'].value_counts().sum() ,0)
df_0_voi.index = new_index
In [26]:
fig = make_subplots(rows=2, cols=2,
                   subplot_titles=('No Churn - International Plan', 
                                   'Churn - International Plan', 
                                   'No Churn - Voice Mail Plan', 
                                   'Churn - Voice Mail Plan'))

colors = [GRAY9, BLUE1]

fig.add_trace(
    go.Bar(x = df_0_int.index,
           y = df_0_int.values,
           marker_color=colors,
           opacity = 0.6,
           name = 'No Churn - International Plan'),
    row=1, col=1)    
    
    
fig.add_trace(
    go.Bar(x = df_0_voi.index,
           y = df_0_voi.values,
           marker_color=colors,
           opacity = 0.6,
    name = 'Churn - International Plan'),
    row=1, col=2)    
    

fig.add_trace(
    go.Bar(x = df_1_int.index,
           y = df_1_int.values,
           marker_color=colors,
           opacity = 0.6,
    name = 'No Churn - Voice Mail Plan'),
    row=2, col=1)   

fig.add_trace(
    go.Bar(x = df_1_voi.index,
           y = df_1_voi.values,
           marker_color=colors,
           opacity = 0.6,
    name = 'No Churn - Voice Mail  Plan'),
    row=2, col=2)  

# Edit the layout
fig.update_layout(template = 'simple_white',
                  font_family="Arial",
                  font_color= GRAY6, 
                  title_font_family="Arial Bold", 
                  title_font_size = 25, 
                  title_font_color= GRAY3, 
                  legend_title_font_color=GRAY6,
                 showlegend=False)

fig['layout']['yaxis']['title']='Percentage (%)'
fig['layout']['yaxis3']['title']='Percentage (%)'

fig.update_layout(height=600, width=950)

fig.update_layout(yaxis1 = dict(range=[0, 100]),
                  yaxis2 = dict(range=[0, 100]),
                  yaxis3 = dict(range=[0, 100]),
                  yaxis4 = dict(range=[0, 100]))

fig.show()
  • It can be seen that the behavior between voice mail and international plans for churn and no churn clients is quite different.
  • The percentage no churn clients with international plan is 7%, although from churn clients is almost 30 %. These can mean that the efficient use of an international plan is important to keep the avoid the churn.
  • The opposite behavior is seen for voice mail plans.

Voice mail messages¶

From the customers with voice mail plan, the number of messages in the voice-mail follows a normal distribution with an average of around 30 messages. There is not an evident difference between churn and no-churn clients.

In [27]:
# Create distplot with curve_type set to 'normal'
fig = go.Figure()

df_plot = df[df.voice_mail_plan == 1]

fig.add_trace(
    go.Histogram(x=df_plot[df_plot.churn == 1].number_vmail_messages,
                 histnorm = 'percent',
                marker_color = BLUE1,
                 nbinsx = 15,
                opacity = 0.4,
                 name = 'Yes'))

fig.add_trace(
    go.Histogram(x=df_plot[df_plot.churn == 0].number_vmail_messages,
                 histnorm = 'percent',
                marker_color = ORANGE1,
                 nbinsx = 15,
                opacity = 0.4,
                 name = 'No'))


fig['layout']['xaxis']['title']='Number of voice mail messages'
fig['layout']['yaxis']['title']='Percentage (%)'

# Add title
fig.update_layout(height=400, width=950,
                  template = 'simple_white',
                  font_family="Arial",
                  font_size = 18,
                  font_color= GRAY6, 
                  title_font_family="Arial Bold", 
                  title_font_size = 25, 
                  title_font_color= GRAY3, 
                  legend_title_font_color=GRAY6,
                  legend_title_text='Churn',
                  showlegend=True)

mean_churn = df_plot[df_plot.churn == 1].number_vmail_messages.median()
mean_no_churn = df_plot[df_plot.churn == 0].number_vmail_messages.median()

fig.add_annotation(x=10, y=25,
            text= 'Mean Churn: ' +str(mean_churn),
            showarrow=False)

fig.add_annotation(x=10, y=21,
            text= 'Mean No Churn: ' +str(mean_no_churn),
            showarrow=False)

fig.update_layout(barmode='group', bargap=0.10,bargroupgap=0.0)

fig.show()

Calls characteristics¶

Lets make a singular analysis of the total minutes, number of calls and the charge for each time of the day (day, evening and night).

In [28]:
hist_data = [df.total_day_minutes.values,
             df.total_eve_minutes.values,
            df.total_night_minutes.values]

group_labels  = ['day', 'evening', 'night']

colors = [BLUE3, ORANGE1, GREEN2]

fig = ff.create_distplot(hist_data, group_labels, 
                         bin_size=10,
                         show_hist=False, 
                         show_rug= False,
                         colors=colors)

fig.update_layout(height=400, width=950,
                  title_text='total minutes',
                  template = 'simple_white',
                  font_family="Arial",
                  font_size = 18,
                  font_color= GRAY6, 
                  title_font_family="Arial Bold", 
                  title_font_size = 25, 
                  title_font_color= GRAY3, 
                  legend_title_font_color=GRAY6,
                  legend_title_text='day time',
                  showlegend=True)
fig.show()
  • It can be seen that the distribution of number of minutes in the evening and night are similar with a an 200 minutes average.
  • On the other hand, the total minutes in calls by the day is smaller.
  • Considering the plot below, it can be seen that despite the difference in total minutes, the number of calls is similar in each part of the day.
In [29]:
hist_data = [df.total_day_calls.values,
             df.total_eve_calls.values,
            df.total_night_calls.values]

group_labels  = ['day', 'evening', 'night']

colors = [BLUE3, ORANGE1, GREEN2]

fig = ff.create_distplot(hist_data, group_labels, 
                         bin_size=10,
                         show_hist=False, 
                         show_curve= True,
                         show_rug= False,
                         colors=colors)

fig.update_layout(height=400, width=950,
                  title_text='total calls',
                  template = 'simple_white',
                  font_family="Arial",
                  font_size = 18,
                  font_color= GRAY6, 
                  title_font_family="Arial Bold", 
                  title_font_size = 25, 
                  title_font_color= GRAY3, 
                  legend_title_font_color=GRAY6,
                  legend_title_text='day time',
                  showlegend=True)
fig.show()
In [30]:
hist_data = [df.total_day_charge.values,
             df.total_eve_charge.values,
            df.total_night_charge.values]

group_labels  = ['day', 'evening', 'night']

colors = [BLUE3, ORANGE1, GREEN2]

fig = ff.create_distplot(hist_data, group_labels, 
                         bin_size=1,
                         show_hist=True, 
                         show_curve= False,
                         show_rug= False,
                         colors=colors)

fig.update_layout(height=400, width=950,
                  title_text='total charge',
                  template = 'simple_white',
                  font_family="Arial",
                  font_size = 18,
                  font_color= GRAY6, 
                  title_font_family="Arial Bold", 
                  title_font_size = 25, 
                  title_font_color= GRAY3, 
                  legend_title_font_color=GRAY6,
                  legend_title_text='day time',
                  showlegend=True)

fig.add_annotation(x=31, y=0.06,
            text= str(round(df.total_day_charge.values.mean(),0 )),
            showarrow=False)

fig.add_annotation(x=17, y=0.11,
            text= str(round(df.total_eve_charge.values.mean(),0 )),
            showarrow=False)

fig.add_annotation(x=9, y=0.18,
            text= str(round(df.total_night_charge.values.mean(),0 )),
            showarrow=False)

fig.show()

Although the number of calls and total call is similar the total amount of charge is very different between each time of the day. By day the price can be around 3.5x more expensive than at nigh.

In [31]:
fig = go.Figure()


fig.add_trace(
    go.Scatter(x=df.total_day_minutes,
               y = df.total_day_charge,
              mode = 'markers',
              marker_color = BLUE3,
               opacity= 0.6,
              name = 'day'))

fig.add_trace(
    go.Scatter(x=df.total_eve_minutes,
               y = df.total_eve_charge,
              mode = 'markers',
              marker_color = ORANGE1,
               opacity= 0.6,
              name = 'evening'))

fig.add_trace(
    go.Scatter(x=df.total_night_minutes,
               y = df.total_night_charge,
              mode = 'markers',
              marker_color = GREEN2,
               opacity= 0.6,
              name = 'night'))

fig['layout']['xaxis']['title']='Total amount of minutes'
fig['layout']['yaxis']['title']='Total charge'

fig.update_layout(height=400, width=900,
                  template = 'simple_white',
                  font_family="Arial",
                  font_size = 18,
                  font_color= GRAY6, 
                  title_font_family="Arial Bold", 
                  title_font_size = 25, 
                  title_font_color= GRAY3, 
                  legend_title_font_color=GRAY6,
                  legend_title_font_size = 18,
                  legend_title_text='day time',
                  showlegend=True)

fig.show()

The amount of charge is directly proportional with the amount of minutes used at each time of the day

Total amount of minutes, charge and calls¶

Lets sum the minutes, charge and number of calls for the entire day.

In [32]:
df['total_minutes'] = df.total_day_minutes + df.total_eve_minutes + df.total_night_minutes
df['total_charge'] = df.total_day_charge + df.total_eve_charge + df.total_night_charge
df['total_calls'] = df.total_day_calls + df.total_eve_calls + df.total_night_calls
df['charge_per_minute'] = df['total_charge'] / df['total_minutes']
df['charge_per_call'] = df['total_charge'] / df['total_calls']

df.head()
Out[32]:
state account_length area_code international_plan voice_mail_plan number_vmail_messages total_day_minutes total_day_calls total_day_charge total_eve_minutes ... total_intl_minutes total_intl_calls total_intl_charge number_customer_service_calls churn total_minutes total_charge total_calls charge_per_minute charge_per_call
4056 MI 62 408 0 0 0 119.3 94 20.28 224.0 ... 12.1 6 3.27 5 1 500.0 46.37 253 0.092740 0.183281
4017 CO 97 415 0 1 27 125.9 90 21.40 123.4 ... 7.3 7 1.97 0 0 479.3 42.24 252 0.088129 0.167619
2819 VT 60 415 0 0 0 193.9 118 32.96 85.0 ... 13.2 8 3.56 3 0 489.0 49.64 362 0.101513 0.137127
1069 MT 74 415 0 0 0 162.7 102 27.66 292.0 ... 8.7 6 2.35 0 0 638.0 60.73 287 0.095188 0.211603
344 AZ 117 408 0 0 0 239.9 84 40.78 174.8 ... 9.8 2 2.65 0 0 624.2 65.07 283 0.104245 0.229929

5 rows × 25 columns

In [33]:
fig = make_subplots(rows=1, cols=3)

# Total minutes
fig.add_trace(go.Box(x=df[df.churn == 1].total_minutes, 
                     marker_color = ORANGE1,
                     name = 'Yes'),
             row = 1, col = 1)

fig.add_trace(go.Box(x=df[df.churn == 0].total_minutes, 
                     marker_color = GRAY8,
                     name = 'No'),
             row = 1, col = 1)

# Total charge
fig.add_trace(go.Box(x=df[df.churn == 1].total_charge, 
                     marker_color = ORANGE1, 
                     name = ''),
             row = 1, col = 2)

fig.add_trace(go.Box(x=df[df.churn == 0].total_charge, 
                     marker_color = GRAY8, 
                     name = ' '),
             row = 1, col = 2)

# Total calls
fig.add_trace(go.Box(x=df[df.churn == 1].total_calls, 
                     marker_color = ORANGE1, 
                     name = ''),
             row = 1, col = 3)

fig.add_trace(go.Box(x=df[df.churn == 0].total_calls, 
                     marker_color = GRAY8, 
                     name = ' '),
             row = 1, col = 3)

fig['layout']['xaxis1']['title']='total minutes'
fig['layout']['xaxis2']['title']='total charge'
fig['layout']['xaxis3']['title']='total calls'
fig['layout']['yaxis1']['title']='Churn'

fig.update_layout(height=400, width=950,
                  template = 'simple_white',
                  font_family="Arial",
                  font_size = 18,
                  font_color= GRAY6, 
                  title_font_family="Arial Bold", 
                  title_font_size = 25, 
                  title_font_color= GRAY3, 
                  legend_title_font_color=GRAY6,
                 showlegend=False)

fig.show()
  • Considering the calls made during all the day, it is possible to see that the number doesn't change between churn and no churn clients.
  • On the other hand, the total amount of minutes, and consequently the total charge is different between clients that are satisfied and the ones who are not.
  • This can lead us think that clients that do require an increased use of company services are more suitable to change the teleco services provider. It can be for different reasons like amount charger, problems with service area, etc.

Number customer service calls¶

As expected the number of customer service calls has a significant difference between satisfied and non-satisfied clients. Clients not satisfied usually do more calls to the customer service. Considering the histplot below it can been seen that satisfied clients call an average of 1 time (considering the median), while the churn clients call 2 times.

In [34]:
group_labels = ['Yes', 'No']

colors = [ ORANGE1, GRAY9]

# Create distplot with curve_type set to 'normal'
fig = ff.create_distplot([df[df.churn == 1].number_customer_service_calls,
                          df[df.churn == 0].number_customer_service_calls], group_labels,
                         #bin_size= 25, 
                         curve_type='normal', # override default 'kde'
                         show_hist = False,
                         colors=colors, 
                         show_rug=False)

fig['layout']['xaxis']['title']='# customer service calls'

# Add title
fig.update_layout(height=400, width=950,
                  template = 'simple_white',
                  font_family="Arial",
                  font_size = 18,
                  font_color= GRAY6, 
                  title_font_family="Arial Bold", 
                  title_font_size = 25, 
                  title_font_color= GRAY3, 
                  legend_title_font_color=GRAY6,
                  legend_title_text='Churn',
                  showlegend=True)
fig.show()
In [35]:
fig = go.Figure()

# Total minutes
fig.add_trace(go.Box(x=df[df.churn == 1].number_customer_service_calls, 
                     marker_color = ORANGE1,
                     name = 'Yes'))

fig.add_trace(go.Box(x=df[df.churn == 0].number_customer_service_calls, 
                     marker_color = GRAY8,
                     name = 'No'))

fig['layout']['xaxis1']['title']='# customer service calls'
fig['layout']['yaxis1']['title']='Churn'

fig.update_layout(height=400, width=950,
                  template = 'simple_white',
                  font_family="Arial",
                  font_size = 18,
                  font_color= GRAY6, 
                  title_font_family="Arial Bold", 
                  title_font_size = 25, 
                  title_font_color= GRAY3, 
                  legend_title_font_color=GRAY6,
                 showlegend=False)

Data pre-processing¶

Balance train data¶

In [36]:
y_train3.value_counts()
Out[36]:
0    2920
1     480
Name: churn, dtype: int64

We can see that the churn variable, which we want to predict is unbalanced. Because we have a small amount of data we will upsample the minor category ('yes'/1)

In [37]:
# Get categorical columns indices
cols_category = X_train3.select_dtypes(include=['category'])
cols = X_train3.columns

col_idx = [ind for ind, x in enumerate(cols) if x in cols_category]
In [38]:
X_res, y_res = SMOTENC( categorical_features = col_idx,
                       n_jobs = -1).fit_resample(X_train3, y_train3)
In [39]:
y_res.value_counts()
Out[39]:
0    2920
1    2920
Name: churn, dtype: int64

The train data is now balanced with a total of 5830 rows.

Dummy variables¶

Lets take a closer look to the final dataset to understand how we can perform the feature engineering in order to get the most value from the data. The first thing we can improve is the area_code variable which is a categorical variable with only 3 distinct values (415, 408, 510). Lets create dummy variables.

In [40]:
X_res_dummy = pd.get_dummies(X_res, columns= [ 'area_code', 'state'], drop_first= True)
X_res_dummy.head()
Out[40]:
account_length international_plan voice_mail_plan number_vmail_messages total_day_minutes total_day_calls total_day_charge total_eve_minutes total_eve_calls total_eve_charge ... state_SD state_TN state_TX state_UT state_VA state_VT state_WA state_WI state_WV state_WY
0 62 0 0 0 119.3 94 20.28 224.0 81 19.04 ... 0 0 0 0 0 0 0 0 0 0
1 97 0 1 27 125.9 90 21.40 123.4 91 10.49 ... 0 0 0 0 0 0 0 0 0 0
2 60 0 0 0 193.9 118 32.96 85.0 110 7.23 ... 0 0 0 0 0 1 0 0 0 0
3 74 0 0 0 162.7 102 27.66 292.0 105 24.82 ... 0 0 0 0 0 0 0 0 0 0
4 117 0 0 0 239.9 84 40.78 174.8 106 14.86 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 69 columns

Feature Engineering¶

We will build several features to support the machine learning. We will get detailed information on national calls and overall calls cost, time and number.

In [41]:
X_res_fe = X_res_dummy.copy()


# National Calls features
X_res_fe['total_national_calls'] = X_res_dummy.total_day_calls + X_res_dummy.total_eve_calls + X_res_dummy.total_night_calls
X_res_fe['total_national_charge'] = X_res_dummy.total_day_charge + X_res_dummy.total_eve_charge + X_res_dummy.total_night_charge
X_res_fe['total_national_minutes'] = X_res_dummy.total_day_minutes + X_res_dummy.total_eve_minutes + X_res_dummy.total_night_minutes
X_res_fe['national_cost_by_calls'] = X_res_fe['total_national_charge']/X_res_fe['total_national_calls']
X_res_fe['national_cost_by_minutes'] = X_res_fe['total_national_charge']/X_res_fe['total_national_minutes']

# Total Calls features
X_res_fe['total_calls'] = X_res_dummy.total_day_calls + X_res_dummy.total_eve_calls + X_res_dummy.total_night_calls + X_res_dummy.total_intl_calls
X_res_fe['total_charge'] = X_res_dummy.total_day_charge + X_res_dummy.total_eve_charge + X_res_dummy.total_night_charge + X_res_dummy.total_intl_charge
X_res_fe['total_minutes'] = X_res_dummy.total_day_minutes + X_res_dummy.total_eve_minutes + X_res_dummy.total_night_minutes + X_res_dummy.total_intl_minutes
X_res_fe['cost_by_calls'] = X_res_fe['total_charge']/X_res_fe['total_calls']
X_res_fe['cost_by_minutes'] = X_res_fe['total_charge']/X_res_fe['total_minutes']


X_res_fe.head()
Out[41]:
account_length international_plan voice_mail_plan number_vmail_messages total_day_minutes total_day_calls total_day_charge total_eve_minutes total_eve_calls total_eve_charge ... total_national_calls total_national_charge total_national_minutes national_cost_by_calls national_cost_by_minutes total_calls total_charge total_minutes cost_by_calls cost_by_minutes
0 62 0 0 0 119.3 94 20.28 224.0 81 19.04 ... 253 46.37 500.0 0.183281 0.092740 259 49.64 512.1 0.191660 0.096934
1 97 0 1 27 125.9 90 21.40 123.4 91 10.49 ... 252 42.24 479.3 0.167619 0.088129 259 44.21 486.6 0.170695 0.090855
2 60 0 0 0 193.9 118 32.96 85.0 110 7.23 ... 362 49.64 489.0 0.137127 0.101513 370 53.20 502.2 0.143784 0.105934
3 74 0 0 0 162.7 102 27.66 292.0 105 24.82 ... 287 60.73 638.0 0.211603 0.095188 293 63.08 646.7 0.215290 0.097541
4 117 0 0 0 239.9 84 40.78 174.8 106 14.86 ... 283 65.07 624.2 0.229929 0.104245 285 67.72 634.0 0.237614 0.106814

5 rows × 79 columns

Scale numerical features¶

Most common machine learning models require the pre-processing of the data, particularly their normalization. SVM, KNN and others use the distance between instances reason why different scales can complicate models convergence.

In [42]:
# All columns
col_names = X_res_fe.columns

# Numerical features to normalize
col_to_normalize= ['account_length', 'number_vmail_messages', 
                   'total_day_minutes', 'total_day_calls', 'total_day_charge', 
                   'total_eve_minutes', 'total_eve_calls', 'total_eve_charge',
                   'total_night_minutes', 'total_night_calls', 'total_night_charge', 
                   'total_intl_minutes', 'total_intl_calls', 'total_intl_charge', 
                   'number_customer_service_calls',
                   'total_national_calls', 'total_national_charge', 'total_national_minutes', 'national_cost_by_calls', 'national_cost_by_minutes', 
                   'total_calls', 'total_charge', 'total_minutes', 'cost_by_calls', 'cost_by_minutes']

# Non numerical features
col_not_to_normalize = [name for name in col_names if name not in col_to_normalize]

transf = make_column_transformer(
    (StandardScaler(), col_to_normalize), remainder = 'passthrough')

X_final = pd.DataFrame(transf.fit_transform(X_res_fe), columns= col_to_normalize + col_not_to_normalize)

Train Model¶

In order to make churn prediction we will be testing different classification models described below:

  • Support Vector Machine (SVC)
  • Logistic Regression
  • Naive Bayes
  • Random Forest
  • XG Boost

The prediction performance will be evaluated considering the accuracy metric as described in kaggle challenge.

$ Accuracy = \frac{ Number of correct predictions} {Number of samples} $

In [43]:
#models
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
import xgboost as xgb
from xgboost import plot_importance
from sklearn.tree import export_graphviz
import lightgbm as lgb

# metrics
from sklearn.metrics import confusion_matrix, accuracy_score

# Model Tuning
from sklearn.model_selection import RandomizedSearchCV

Feature Engineering Function¶

First we will need to build a function to apply the same feature engineering developed for the train data into the validation and test dataset.

In [44]:
def df_feat_eng(df_input):
    
    raw_data = df_input.copy()
    
    raw_data['area_code'] = raw_data['area_code'].apply(lambda x: x.split('_')[-1])
    
    raw_data["international_plan"] = raw_data["international_plan"].map(dict(yes=1, no=0))
    raw_data["voice_mail_plan"] = raw_data["voice_mail_plan"].map(dict(yes=1, no=0))
    
    for col in ['state', 'area_code', 'international_plan', 'voice_mail_plan']:
        raw_data[col] = raw_data[col].astype('category')
        
    raw_data['total_minutes'] = raw_data.total_day_minutes + raw_data.total_eve_minutes + raw_data.total_night_minutes
    raw_data['total_charge'] = raw_data.total_day_charge + raw_data.total_eve_charge + raw_data.total_night_charge
    raw_data['total_calls'] = raw_data.total_day_calls + raw_data.total_eve_calls + raw_data.total_night_calls
    raw_data['charge_per_minute'] = raw_data['total_charge'] / raw_data['total_minutes']
    raw_data['charge_per_call'] = raw_data['total_charge'] / raw_data['total_calls']
    
    X_raw= pd.get_dummies(raw_data, columns= [ 'area_code', 'state'], drop_first= True)
    
    # National Calls features
    X_raw['total_national_calls'] = X_raw.total_day_calls + X_raw.total_eve_calls + X_raw.total_night_calls
    X_raw['total_national_charge'] = X_raw.total_day_charge + X_raw.total_eve_charge + X_raw.total_night_charge
    X_raw['total_national_minutes'] = X_raw.total_day_minutes + X_raw.total_eve_minutes + X_raw.total_night_minutes
    X_raw['national_cost_by_calls'] = X_raw['total_national_charge']/X_raw['total_national_calls']
    X_raw['national_cost_by_minutes'] = X_raw['total_national_charge']/X_raw['total_national_minutes']

    # Total Calls features
    X_raw['total_calls'] = X_raw.total_day_calls + X_raw.total_eve_calls + X_raw.total_night_calls + X_raw.total_intl_calls
    X_raw['total_charge'] = X_raw.total_day_charge + X_raw.total_eve_charge + X_raw.total_night_charge + X_raw.total_intl_charge
    X_raw['total_minutes'] = X_raw.total_day_minutes + X_raw.total_eve_minutes + X_raw.total_night_minutes + X_raw.total_intl_minutes
    X_raw['cost_by_calls'] = X_raw['total_charge']/X_raw['total_calls']
    X_raw['cost_by_minutes'] = X_raw['total_charge']/X_raw['total_minutes']
    
    X_fe_final = pd.DataFrame(transf.transform(X_raw), columns= col_to_normalize + col_not_to_normalize)

    return X_fe_final

X_test_final = df_feat_eng(X_test)
In [45]:
# QC on columns number
X_test_final.shape[1] == X_final.shape[1]
Out[45]:
True
In [46]:
# QC columns name
sum(X_test_final.columns == X_final.columns) == 79
Out[46]:
True
In [47]:
# QC on columns type
sum(X_test_final.dtypes == X_final.dtypes) == X_final.shape[1]
Out[47]:
True
In [48]:
# Change labels on test and validation to 0 and 1

y_test_res =  y_test.map(dict(yes=1, no=0))
y_val_res =  y_val.map(dict(yes=1, no=0))

Support Vector Machine (SVC)¶

In [49]:
svc = SVC(probability=True)
In [50]:
svc.fit(X_final, y_res)
Out[50]:
SVC(probability=True)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
SVC(probability=True)
In [51]:
y_pred = svc.predict(X_final)
In [52]:
pd.DataFrame(confusion_matrix(y_res, y_pred),
            columns = ['Pred 0', 'Pred 1'],
            index = ['True 0', 'True 1'])
Out[52]:
Pred 0 Pred 1
True 0 2771 149
True 1 322 2598
In [53]:
# Score on train data
accuracy_score(y_res, y_pred)
Out[53]:
0.9193493150684932
In [54]:
# Score on test data
accuracy_score(y_test_res, svc.predict(X_test_final))
Out[54]:
0.8941176470588236

It can be seen that the model overfitted the train data. Considering the SVC model,

In [55]:
svc2 = SVC(probability=True,
           C = 0.1, 
           kernel = 'rbf')

svc2.fit(X_final, y_res)

svc_train_score = round(accuracy_score(y_res     , svc2.predict(X_final     )), 2)
svc_test_score  = round(accuracy_score(y_test_res, svc2.predict(X_test_final)), 2)

print(f'Train score: {svc_train_score}') 
print(f'Test score: {svc_train_score}')     
      
Train score: 0.84
Test score: 0.84

Logistic Regression¶

In [56]:
# Define model
logr = LogisticRegression(solver = 'lbfgs',
                          max_iter= 1000)

# Train model
logr.fit(X_final, y_res)

# Predict with train data
y_pred = logr.predict(X_final)
In [57]:
# Confusion matrix
pd.DataFrame(confusion_matrix(y_res, y_pred),
            columns = ['Pred 0', 'Pred 1'],
            index = ['True 0', 'True 1'])
Out[57]:
Pred 0 Pred 1
True 0 2293 627
True 1 626 2294
In [58]:
# Score on train data
accuracy_score(y_res, y_pred)
Out[58]:
0.7854452054794521

Logistic regression presents worse results than SVM, however we will compute the train and test score to input in the final table comparison. We will also mitigate any overfitting that can appear in the training.

In [59]:
logr2 = LogisticRegression(solver = 'lbfgs',
                           C = 0.5,
                           max_iter= 1000)

logr2.fit(X_final, y_res)

logr_train_score = round(accuracy_score(y_res     , logr2.predict(X_final     )), 2)
logr_test_score  = round(accuracy_score(y_test_res, logr2.predict(X_test_final)), 2)

print(f'Train score: {logr_train_score}') 
print(f'Test score: {logr_test_score}')     
Train score: 0.78
Test score: 0.78

Naive Bayes¶

In [60]:
# Define model
gnb = GaussianNB()

# Train model
gnb.fit(X_final, y_res)

# Predict with train data
y_pred = gnb.predict(X_final)
In [61]:
# Confusion matrix
pd.DataFrame(confusion_matrix(y_res, y_pred),
            columns = ['Pred 0', 'Pred 1'],
            index = ['True 0', 'True 1'])
Out[61]:
Pred 0 Pred 1
True 0 1877 1043
True 1 818 2102
In [62]:
# Score on train data
accuracy_score(y_res, y_pred)
Out[62]:
0.6813356164383562
In [63]:
gnb2 = GaussianNB()

gnb2.fit(X_final, y_res)

gnb_train_score = round(accuracy_score(y_res     , gnb2.predict(X_final     )), 2)
gnb_test_score  = round(accuracy_score(y_test_res, gnb2.predict(X_test_final)), 2)

print(f'Train score: {gnb_train_score}') 
print(f'Test score: {gnb_test_score}')     
Train score: 0.68
Test score: 0.62

Random Forest¶

In [64]:
# Define model
rf = RandomForestClassifier(max_depth= 10)

# Train model
rf.fit(X_final, y_res)

# Predict with train data
y_pred = rf.predict(X_final)

# Confusion matrix
pd.DataFrame(confusion_matrix(y_res, y_pred),
            columns = ['Pred 0', 'Pred 1'],
            index = ['True 0', 'True 1'])
Out[64]:
Pred 0 Pred 1
True 0 2906 14
True 1 389 2531
In [65]:
# Score on train data
accuracy_score(y_res, y_pred)
Out[65]:
0.9309931506849315

Lets evaluate the overfitting and the test score.

In [66]:
rf2 = RandomForestClassifier(max_depth= 10)

rf2.fit(X_final, y_res)

rf_train_score = round(accuracy_score(y_res     , rf2.predict(X_final     )), 2)
rf_test_score  = round(accuracy_score(y_test_res, rf2.predict(X_test_final)), 2)

print(f'Train score: {rf_train_score}') 
print(f'Test score: {rf_test_score}') 
Train score: 0.94
Test score: 0.94

Extra Trees Classifier¶

In [67]:
# Define model
extra = ExtraTreesClassifier()

# Train model
extra.fit(X_final, y_res)

# Predict with train data
y_pred = extra.predict(X_final)

# Confusion matrix
pd.DataFrame(confusion_matrix(y_res, y_pred),
            columns = ['Pred 0', 'Pred 1'],
            index = ['True 0', 'True 1'])
Out[67]:
Pred 0 Pred 1
True 0 2920 0
True 1 0 2920
In [68]:
# Score on train data
accuracy_score(y_res, y_pred)
Out[68]:
1.0
In [69]:
extra2 = ExtraTreesClassifier(max_depth=11)

extra2.fit(X_final, y_res)

extra_train_score = round(accuracy_score(y_res     , extra2.predict(X_final     )), 2)
extra_test_score  = round(accuracy_score(y_test_res, extra2.predict(X_test_final)), 2)

print(f'Train score: {extra_train_score}') 
print(f'Test score: {extra_test_score}') 
Train score: 0.9
Test score: 0.92

XGBoost¶

In [70]:
# Supported tree methods are `gpu_hist`, `approx`, and `hist`.
boost = xgb.XGBClassifier(
    tree_method="gpu_hist", 
    objective = 'binary:logistic', #binary:logitraw
    eval_metric = 'auc',
    sampling_method = 'gradient_based',
    n_jobs = -1,
    # Default Parameters
    learning_rate = 0.3,
    max_depth = 6,
    colsample_bytree =1,
    subsample = 1,
    min_split_loss = 0,
    min_child_weight = 1
)

boost.fit(X_final, y_res)

# Predict with train data
y_pred = boost.predict(X_final)

# Confusion matrix
pd.DataFrame(confusion_matrix(y_res, y_pred),
            columns = ['Pred 0', 'Pred 1'],
            index = ['True 0', 'True 1'])
Out[70]:
Pred 0 Pred 1
True 0 2920 0
True 1 1 2919
In [71]:
# Score on train data
accuracy_score(y_res, y_pred)
Out[71]:
0.9998287671232877
In [72]:
boost2 = xgb.XGBClassifier(tree_method="gpu_hist", 
                               objective = 'binary:logistic', #binary:logitraw
                               eval_metric = 'auc',
                               sampling_method = 'uniform', #'gradient_based'
                               n_jobs = -1,
                               # Default Parameters
                               learning_rate = 0.15,
                               max_depth = 5,
                               colsample_bytree =1,
                               subsample = 1,
                               min_split_loss = 1,
                               min_child_weight = 1
                              )

boost2.fit(X_final, y_res)

xgb_train_score = round(accuracy_score(y_res     , boost2.predict(X_final     )), 2)
xgb_test_score  = round(accuracy_score(y_test_res, boost2.predict(X_test_final)), 2)

print(f'Train score: {xgb_train_score}') 
print(f'Test score: {xgb_test_score}') 
Train score: 0.97
Test score: 0.97

Models Resume¶

In [73]:
data = [{'train score': svc_train_score, 'test score': svc_test_score},
       {'train score': logr_train_score, 'test score': logr_test_score},
       {'train score': gnb_train_score, 'test score': gnb_test_score},
       {'train score': rf_train_score, 'test score': rf_test_score},
       {'train score': extra_train_score, 'test score': extra_test_score},
       {'train score': xgb_train_score, 'test score': xgb_test_score}]

df_resume = pd.DataFrame(data, index = ['SVM',
                                       'Logistic Regr.',
                                       'Naive Bayes',
                                       'Random Forest',
                                       'ExtraTrees',
                                       'XGBoost'])

df_resume
Out[73]:
train score test score
SVM 0.84 0.88
Logistic Regr. 0.78 0.78
Naive Bayes 0.68 0.62
Random Forest 0.94 0.94
ExtraTrees 0.90 0.92
XGBoost 0.97 0.97

The XGBoost model presented the best score for both train and test, with small overfitting that can be improved in model tuning. Also, the XGBoost has GPU support which make the model much faster to train, allowing to test more combinations in the model tuning. For both this reason, will be continue with XGBoost only.

Feature Importance¶

In [74]:
fig, ax = plt.subplots(1,1,figsize=(12,10))
plot_importance(boost2, ax=ax)
plt.tight_layout()
plt.show()

By considering the feature importance from the XGBoost model it can be seen that the total evening and international call minutes are the two most important aspects to evaluate if a customer is churn or not. Other feature to consider are account length and total charge.

XGBoost tuning¶

Because cross-validation already considers a train and test set by dividing the entire data set, we will join our train and test data into a single df and perform the feature engineering.

In [75]:
df_tuned = pd.concat([X_train, X_test])

y_tuned = pd.concat([y_train, y_test])

print(X_train.shape[0]+X_test.shape[0] == df_tuned.shape[0])
print(y_train.shape[0]+y_test.shape[0] == y_tuned.shape[0])
True
True
In [76]:
df_tuned_fe = df_feat_eng(df_tuned)

y_res = y_tuned.map(dict(yes=1, no=0))
In [77]:
xgb_estimator = xgb.XGBClassifier(tree_method="gpu_hist", 
                                  objective = 'binary:logistic',
                                  eval_metric = 'auc',
                                  sampling_method = 'gradient_based', 
                                  n_jobs = -1
                                 )
                             
parameters = {
    'eta': [0.001, 0.01, 0.03, 0.05, 0.075, 0.1, 0.3],
    'max_depth': range(2, 7, 1),
    'alpha': np.arange(0, 1, 0.1),
    'lambda': np.arange(0, 1, 0.1)
}
In [78]:
%%time
xgb_grid = RandomizedSearchCV(
    estimator=xgb_estimator,
    param_distributions=parameters,
    n_iter = 1500,
    scoring = 'accuracy',
    n_jobs = -1,
    cv = 8,
    verbose = 3
)

xgb_grid.fit(df_tuned_fe, y_res)
Fitting 8 folds for each of 1500 candidates, totalling 12000 fits
CPU times: total: 55.1 s
Wall time: 1h 23min 50s
Out[78]:
RandomizedSearchCV(cv=8,
                   estimator=XGBClassifier(base_score=None, booster=None,
                                           callbacks=None,
                                           colsample_bylevel=None,
                                           colsample_bynode=None,
                                           colsample_bytree=None,
                                           early_stopping_rounds=None,
                                           enable_categorical=False,
                                           eval_metric='auc', gamma=None,
                                           gpu_id=None, grow_policy=None,
                                           importance_type=None,
                                           interaction_constraints=None,
                                           learning_rate=None, max_bin=None...
                                           n_estimators=100, n_jobs=-1,
                                           num_parallel_tree=None,
                                           predictor=None, random_state=None,
                                           reg_alpha=None, reg_lambda=None, ...),
                   n_iter=1500, n_jobs=-1,
                   param_distributions={'alpha': array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]),
                                        'eta': [0.001, 0.01, 0.03, 0.05, 0.075,
                                                0.1, 0.3],
                                        'lambda': array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]),
                                        'max_depth': range(2, 7)},
                   scoring='accuracy', verbose=3)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomizedSearchCV(cv=8,
                   estimator=XGBClassifier(base_score=None, booster=None,
                                           callbacks=None,
                                           colsample_bylevel=None,
                                           colsample_bynode=None,
                                           colsample_bytree=None,
                                           early_stopping_rounds=None,
                                           enable_categorical=False,
                                           eval_metric='auc', gamma=None,
                                           gpu_id=None, grow_policy=None,
                                           importance_type=None,
                                           interaction_constraints=None,
                                           learning_rate=None, max_bin=None...
                                           n_estimators=100, n_jobs=-1,
                                           num_parallel_tree=None,
                                           predictor=None, random_state=None,
                                           reg_alpha=None, reg_lambda=None, ...),
                   n_iter=1500, n_jobs=-1,
                   param_distributions={'alpha': array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]),
                                        'eta': [0.001, 0.01, 0.03, 0.05, 0.075,
                                                0.1, 0.3],
                                        'lambda': array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]),
                                        'max_depth': range(2, 7)},
                   scoring='accuracy', verbose=3)
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric='auc', gamma=None,
              gpu_id=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=None, max_bin=None,
              max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
              max_leaves=None, min_child_weight=None, missing=nan,
              monotone_constraints=None, n_estimators=100, n_jobs=-1,
              num_parallel_tree=None, predictor=None, random_state=None,
              reg_alpha=None, reg_lambda=None, ...)
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric='auc', gamma=None,
              gpu_id=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=None, max_bin=None,
              max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
              max_leaves=None, min_child_weight=None, missing=nan,
              monotone_constraints=None, n_estimators=100, n_jobs=-1,
              num_parallel_tree=None, predictor=None, random_state=None,
              reg_alpha=None, reg_lambda=None, ...)
In [87]:
#save model
import pickle
pkl_filename = 'final_model.pkl'
with open(pkl_filename, 'wb') as file:
    pickle.dump(xgb_grid, file)
In [88]:
#load model
with open(pkl_filename, 'rb') as file:
    xgb_grid = pickle.load(file)
In [79]:
print(f'Best XGBoost Score: {round(xgb_grid.best_score_,3 )}')
print(f'Best XGBoost Parameters: {xgb_grid.best_params_}')
Best XGBoost Score: 0.982
Best XGBoost Parameters: {'max_depth': 3, 'lambda': 0.30000000000000004, 'eta': 0.3, 'alpha': 0.1}

The cross validation gave the best parameters has:

  • max_depth : 3
  • lambda: 0.3
  • eta : 0.3
  • alpha: 0.1

Lets evaluate the final model's score using the validation data set that was not used until now.

In [80]:
X_val_final = df_feat_eng(X_val)

accuracy_score(y_val_res, xgb_grid.predict(X_val_final))
Out[80]:
0.9670588235294117

The final score using the validation dataset was 0.978 which is close to the score we got from the training using crossing validation. We will need to train the model model with the best parameters and will all the data to include as many information as possible before prediction on the final data set that will be submitted to kaggle to get the final score.

In [81]:
# Perform feature engineering on all data
X_final_df = df_feat_eng(X)

y_final = y.map(dict(yes=1, no=0))
In [82]:
# Train model
xgb_best = xgb.XGBClassifier(tree_method="gpu_hist", 
                             objective = 'binary:logistic',
                             eval_metric = 'auc',
                             sampling_method = 'gradient_based', 
                             n_jobs = -1,
                             eta = 0.1,
                             max_depth = 3,
                             reg_lambda = 0.8,
                             alpha = 0.7
                            )

xgb_best.fit(X_final_df, y_final)
Out[82]:
XGBClassifier(alpha=0.7, base_score=0.5, booster='gbtree', callbacks=None,
              colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
              early_stopping_rounds=None, enable_categorical=False, eta=0.1,
              eval_metric='auc', gamma=0, gpu_id=0, grow_policy='depthwise',
              importance_type=None, interaction_constraints='',
              learning_rate=0.100000001, max_bin=256, max_cat_to_onehot=4,
              max_delta_step=0, max_depth=3, max_leaves=0, min_child_weight=1,
              missing=nan, monotone_constraints='()', n_estimators=100,
              n_jobs=-1, num_parallel_tree=1, predictor='auto', random_state=0, ...)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
XGBClassifier(alpha=0.7, base_score=0.5, booster='gbtree', callbacks=None,
              colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
              early_stopping_rounds=None, enable_categorical=False, eta=0.1,
              eval_metric='auc', gamma=0, gpu_id=0, grow_policy='depthwise',
              importance_type=None, interaction_constraints='',
              learning_rate=0.100000001, max_bin=256, max_cat_to_onehot=4,
              max_delta_step=0, max_depth=3, max_leaves=0, min_child_weight=1,
              missing=nan, monotone_constraints='()', n_estimators=100,
              n_jobs=-1, num_parallel_tree=1, predictor='auto', random_state=0, ...)
In [83]:
# Import new dataset
new_data = pd.read_csv('data/test.csv', delimiter=',')

id_col = new_data.id

new_data = new_data.drop('id', axis = 1)
In [84]:
# Feature Engineering and predict

new_data_fe = df_feat_eng(new_data)

churn = xgb_best.predict(new_data_fe)

churn2 = ['yes' if x == 1 else 'no' for x in churn]
In [85]:
export_df = pd.DataFrame({'id': id_col, 'churn': churn2})

export_df.to_csv('results_2022_08_21_v1.csv', sep = ',', index = False)

Final Score:

  • Private : 0.97142
  • Public: 0.98222

We were able to identify the probable churn clients with a 98 % accuracy. Considering that 85% of the customers on the train that were "no churn" we were able to improve the final prediction.